AITopics | ray serve

Collaborating Authors

ray serve

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

Yang, Yuting, Yuan, Tiancheng, Hashim, Jamal, Garrett, Thiago, Qian, Jeffrey, Zhang, Ann, Wang, Yifan, Song, Weijia, Birman, Ken

arXiv.org Artificial IntelligenceNov-5-2025

There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.02062

Country: North America > United States (0.69)

Genre: Research Report (0.64)

Industry:

Information Technology (0.68)
Energy (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(4 more...)

Add feedback

ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models

Fu, Yao, Xue, Leyang, Huang, Yeqi, Brabete, Andrei-Octavian, Ustiugov, Dmitrii, Patel, Yuvraj, Mai, Luo

arXiv.org Artificial IntelligenceJan-25-2024

Furthermore, LLM inference latency is difficult to predict because their response time depends on the output This paper presents ServerlessLLM, a locality-enhanced length, which can vary significantly [24, 39, 77], due to iterative serverless inference system for Large Language Models output token generation. To achieve low latency, processing (LLMs). ServerlessLLM exploits the substantial capacity and an LLM request often necessitates the use of several bandwidth of storage and memory devices available on GPU GPUs for durations ranging from seconds to minutes. In practice, servers, thereby reducing costly remote checkpoint downloads LLM service providers need to host a large number of and achieving efficient checkpoint loading. ServerlessLLM LLMs catered to different developers, leading to significant achieves this through three main contributions: (i) fast LLM GPU consumption [15] and impeding the sustainability of checkpoint loading via a novel loading-optimized checkpoint LLM services [19]. As a result, LLM inference services have format design, coupled with an efficient multi-tier checkpoint to impose strict caps on the number of requests sent to their loading system; (ii) locality-driven LLM inference with live services from their users (e.g., 40 messages per 3 hours for migration, which allows ServerlessLLM to effectively achieve ChatGPT [51]), showing the provider's current inability to locality-driven server allocation while preserving the low latency satisfy the LLM inference demand. Researchers [19] project of ongoing LLM inference; and (iii) locality-aware that LLM inference costs may increase by > 50 when it server allocation, enabling ServerlessLLM to evaluate the status reaches the popularity of Google Search. of each server in a cluster and effectively schedule model To reduce GPU consumption, LLM service providers are startup time to capitalize on local checkpoint placement. Our exploring serverless inference, as seen in systems like Amazon comprehensive experiments, which include microbenchmarks SageMaker [60], Azure [46], KServe [11] and Hugging-and real-world traces, show that ServerlessLLM surpasses Face [31].

inference, latency, serverlessllm, (14 more...)

arXiv.org Artificial Intelligence

2401.14351

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > California > San Diego County > Carlsbad (0.04)
(3 more...)

Genre: Research Report (0.82)

Industry: Information Technology > Services (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Serving ML Models in Production: Common Patterns - KDnuggets

#artificialintelligenceNov-5-2021, 00:21:45 GMT

This post is based on Simon Mo's "Patterns of Machine Learning in Production" talk from Ray Summit 2021. Over the past couple years, we've listened to ML practitioners across many different industries to learn and improve the tooling around ML production use cases. Through this, we've seen 4 common patterns of machine learning in production: pipeline, ensemble, business logic, and online learning. In the ML serving space, implementing these patterns typically involves a tradeoff between ease of development and production readiness. Ray Serve was built to support these patterns by being both easy to develop and production ready. It is a scalable and programmable serving framework built on top of Ray to help you scale your microservices and ML models in production.

pipeline, ray serve, use case, (16 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.60)

Add feedback

How the Integrations Between Ray & MLflow Aids Distributed ML Production

#artificialintelligenceMar-4-2021, 23:25:35 GMT

In this blog post, we're announcing two new integrations with Ray and MLflow: Ray Tune MLflow Tracking and Ray Serve MLflow Models, which together make it much easier to build machine learning (ML) models and take them to production. These integrations are available in the latest Ray wheels. You can follow the instructions here to pip install the nightly version of Ray and take a look at the documentation to get started. They will also be in the next Ray release -- version 1.2 Our goal is to leverage the strengths of the two projects: Ray's distributed libraries for scaling training and serving and MLflow's end-to-end model lifecycle management. Let's first take a brief look at what these libraries can do before diving into the new integrations.

experiment, integration, ray tune, (13 more...)

#artificialintelligence

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.40)
Health & Medicine > Therapeutic Area > Immunology (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback